SIGN IN SIGN UP

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

0 0 0 Jupyter Notebook
New File Structure Implementation (#4716) * Created archived folder and moved all workshop notebooks with 0 views to archived * Moved 10 notebooks with 0 views into archived/notebooks * Moved 13 0 view notebooks to archived * Deleted 17 duplicate notebooks * Deleted 17 duplicate notebooks (#4685) * Update SMP v2 notebooks to use latest PyTorch 2.3.1, TSM 2.4.0 release (#4678) * Update SMP v2 notebooks to use latest PT2.3.1-TSM2.4.0 release. * Update SMP v2 shared_scripts * Update minimum sagemaker pysdk version to 2.224 * Updated README, removed broken links and fixed markdown (#4687) * Parsash2 patch 1 (#4690) * tutorials-after-initial-feedback Added descriptive text to make the notebooks stand on their own. * move athena notebook into dedicated folder * renamed athena end2end notebooks * moved pyspark notebook into dedicated directory * minor change: consistent directory naming convention * Added overview, headers, and explantory text Tested the notebook end to end. Added more context for processing jobs and cleaning up. The output is visible in the cells. * Added overview, headers, explanatory text Also added troubleshooting note from further testing. * fix directory locations for new notebooks * clear notebook outputs * added integration for ci test results * updated formatting with black-nb * update athena notebook: fix parse predictions * fixed ci integration for pyspark-etl-training notebook --------- Co-authored-by: Janosch Woschitz <jwos@amazon.de> * Archived remaining geospatial example notebooks * Removed geospatial from README.md * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks * Archived outdated example notebooks between 1-90 views * MLflow setup (#4689) * Add SageMaker MLflow examples * Add badges * Add MLflow setup notebook; upgrade SageMaker Python SDK for deployment notebook * Linting * More linting changes --------- Co-authored-by: Bobby Lindsey <bwlind@amazon.com> * feat: Model monitor json support for Explainability and Bias (#4696) * initial commit of Blog content: "using step decorator for bedrock fine tuning" (https://sim.amazon.com/issues/ML-16440) (#4657) * initial commit of using step decorator for bedrock fine tuning * ran black command on the notebook * Added CI badges * Added CI badges * fixed typo in notebook title --------- Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> * New folder structure (#4694) * Deleted 17 duplicate notebooks (#4685) * Updated README, removed broken links and fixed markdown (#4687) * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks (#4692) * Archived outdated example notebooks between 1-90 views (#4693) --------- Co-authored-by: jsmul <jsmul@amazon.com> * Revert "New folder structure (#4694)" (#4701) This reverts commit 970d88ee18a217610c5c7005bcedb8330c41b774 due to broken blog links * archived 17 notebookswith outdated/redundant funtionality * adding notebook for forecast to canvas workshop (#4704) * adding notebook for forecast to canvas workshop * formatting the notebook using black * Adds notebook for deploying and monitoring llm on sagemaker usin fmeval for evaluation (#4705) Co-authored-by: Brent Friedman <brentfr@amazon.com> * archived 20 notebooks with outdated/redundant functionality * archived 20 notebooks with outdated/redundant funtionality * archived 20 notebooks with outdated/redundant funtionality * archived 21 notebooks with outdated/redundant funtionality * archived 19 notebooks with outdated/redundant funtionality * restored pytorch_multi_model_endpoint back from archived * removed redundant notebooks folder from archived - all notebooks now directly in archived * added new folders for new file structure * added gitkeep files to show folders on github * archived one notebook that was missed * introducing new file structure - part 1 * Update README.md * moved unsorted file back to top level to maintain links * archived recently marked, and removed folder names from file names * new file structure: renamed and moved all evaluated notebooks as of 26 july * new file structure: organized new files and files that still need to be evaluated * Update README.md --------- Co-authored-by: Victor Zhu <viczhu@amazon.com> Co-authored-by: parsash2 <60193914+parsash2@users.noreply.github.com> Co-authored-by: Janosch Woschitz <jwos@amazon.de> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Bobby Lindsey <bwlind@amazon.com> Co-authored-by: zicanl-amazon <115581573+zicanl-amazon@users.noreply.github.com> Co-authored-by: ashrawat <ashrawat_atl@yahoo.com> Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> Co-authored-by: pro-biswa <bisu@amazon.com> Co-authored-by: brentfriedman725 <97409987+brentfriedman725@users.noreply.github.com> Co-authored-by: Brent Friedman <brentfr@amazon.com>
2024-07-26 14:42:50 -07:00
{
"cells": [
{
"cell_type": "markdown",
"id": "236f87ca",
"metadata": {},
"source": [
"# Introduction to JumpStart - Sentence Pair Classification"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "669fa5d4",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "b57a74bc",
"metadata": {},
"source": [
"---\n",
"Welcome to Amazon [SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html)! You can use JumpStart to solve many Machine Learning tasks through one-click in SageMaker Studio, or through [SageMaker JumpStart API](https://sagemaker.readthedocs.io/en/stable/overview.html#use-prebuilt-models-with-sagemaker-jumpstart). \n",
"\n",
"In this demo notebook, we demonstrate how to use the JumpStart API for Sentence Pair Classification. Sentence Pair Classification refers to classifying a pair of input sentence to one of the class labels of the training dataset. We demonstrate two use cases of Sentence Pair Classification models:\n",
"\n",
"* How to use a Transformer model pre-trained on English dataset, and fine-tuned on [QNLI](https://www.tensorflow.org/datasets/catalog/glue#glueqnli) dataset, to perform Natural Language Inference.\n",
"* How to fine-tune a pre-trained Transformer model to a custom dataset, and then run inference on the fine-tuned model.\n",
"\n",
"Note: This notebook was tested on ml.t3.medium instance in Amazon SageMaker Studio with Python 3 (Data Science) kernel and in Amazon SageMaker Notebook instance with conda_python3 kernel.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "b4b73811",
"metadata": {},
"source": [
"1. [Set Up](#1.-Set-Up)\n",
"2. [Select a pre-trained model](#2.-Select-a-pre-trained-model)\n",
"3. [Run inference on the pre-trained model](#3.-Run-inference-on-the-pre-trained-model)\n",
" * [Retrieve JumpStart Artifacts & Deploy an Endpoint](#3.1.-Retrieve-JumpStart-Artifacts-&-Deploy-an-Endpoint)\n",
" * [Example input sentences for inference](#3.2.-Example-input-sentences-for-inference)\n",
" * [Query endpoint and parse response](#3.3.-Query-endpoint-and-parse-response)\n",
" * [Clean up the endpoint](#3.4.-Clean-up-the-endpoint)\n",
"4. [Finetune the pre-trained model on a custom dataset](#4.-Finetune-the-pre-trained-model-on-a-custom-dataset)\n",
" * [Retrieve JumpStart Training artifacts](#4.1.-Retrieve-JumpStart-Training-artifacts)\n",
" * [Set Training parameters](#4.2.-Set-Training-parameters)\n",
" * [Train with Automatic Model Tuning (HPO)](#AMT)\n",
" * [Start Training](#4.4.-Start-Training)\n",
" * [Deploy & run Inference on the fine-tuned model](#4.5.-Deploy-&-run-Inference-on-the-fine-tuned-model)"
]
},
{
"cell_type": "markdown",
"id": "c79f3644",
"metadata": {},
"source": [
"## 1. Set Up\n",
"***\n",
"Before executing the notebook, there are some initial steps required for setup. This notebook requires latest version of sagemaker and ipywidgets.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4927ae83",
"metadata": {},
"outputs": [],
"source": [
"!pip install 'sagemaker<3.0' ipywidgets --upgrade --quiet"
New File Structure Implementation (#4716) * Created archived folder and moved all workshop notebooks with 0 views to archived * Moved 10 notebooks with 0 views into archived/notebooks * Moved 13 0 view notebooks to archived * Deleted 17 duplicate notebooks * Deleted 17 duplicate notebooks (#4685) * Update SMP v2 notebooks to use latest PyTorch 2.3.1, TSM 2.4.0 release (#4678) * Update SMP v2 notebooks to use latest PT2.3.1-TSM2.4.0 release. * Update SMP v2 shared_scripts * Update minimum sagemaker pysdk version to 2.224 * Updated README, removed broken links and fixed markdown (#4687) * Parsash2 patch 1 (#4690) * tutorials-after-initial-feedback Added descriptive text to make the notebooks stand on their own. * move athena notebook into dedicated folder * renamed athena end2end notebooks * moved pyspark notebook into dedicated directory * minor change: consistent directory naming convention * Added overview, headers, and explantory text Tested the notebook end to end. Added more context for processing jobs and cleaning up. The output is visible in the cells. * Added overview, headers, explanatory text Also added troubleshooting note from further testing. * fix directory locations for new notebooks * clear notebook outputs * added integration for ci test results * updated formatting with black-nb * update athena notebook: fix parse predictions * fixed ci integration for pyspark-etl-training notebook --------- Co-authored-by: Janosch Woschitz <jwos@amazon.de> * Archived remaining geospatial example notebooks * Removed geospatial from README.md * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks * Archived outdated example notebooks between 1-90 views * MLflow setup (#4689) * Add SageMaker MLflow examples * Add badges * Add MLflow setup notebook; upgrade SageMaker Python SDK for deployment notebook * Linting * More linting changes --------- Co-authored-by: Bobby Lindsey <bwlind@amazon.com> * feat: Model monitor json support for Explainability and Bias (#4696) * initial commit of Blog content: "using step decorator for bedrock fine tuning" (https://sim.amazon.com/issues/ML-16440) (#4657) * initial commit of using step decorator for bedrock fine tuning * ran black command on the notebook * Added CI badges * Added CI badges * fixed typo in notebook title --------- Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> * New folder structure (#4694) * Deleted 17 duplicate notebooks (#4685) * Updated README, removed broken links and fixed markdown (#4687) * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks (#4692) * Archived outdated example notebooks between 1-90 views (#4693) --------- Co-authored-by: jsmul <jsmul@amazon.com> * Revert "New folder structure (#4694)" (#4701) This reverts commit 970d88ee18a217610c5c7005bcedb8330c41b774 due to broken blog links * archived 17 notebookswith outdated/redundant funtionality * adding notebook for forecast to canvas workshop (#4704) * adding notebook for forecast to canvas workshop * formatting the notebook using black * Adds notebook for deploying and monitoring llm on sagemaker usin fmeval for evaluation (#4705) Co-authored-by: Brent Friedman <brentfr@amazon.com> * archived 20 notebooks with outdated/redundant functionality * archived 20 notebooks with outdated/redundant funtionality * archived 20 notebooks with outdated/redundant funtionality * archived 21 notebooks with outdated/redundant funtionality * archived 19 notebooks with outdated/redundant funtionality * restored pytorch_multi_model_endpoint back from archived * removed redundant notebooks folder from archived - all notebooks now directly in archived * added new folders for new file structure * added gitkeep files to show folders on github * archived one notebook that was missed * introducing new file structure - part 1 * Update README.md * moved unsorted file back to top level to maintain links * archived recently marked, and removed folder names from file names * new file structure: renamed and moved all evaluated notebooks as of 26 july * new file structure: organized new files and files that still need to be evaluated * Update README.md --------- Co-authored-by: Victor Zhu <viczhu@amazon.com> Co-authored-by: parsash2 <60193914+parsash2@users.noreply.github.com> Co-authored-by: Janosch Woschitz <jwos@amazon.de> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Bobby Lindsey <bwlind@amazon.com> Co-authored-by: zicanl-amazon <115581573+zicanl-amazon@users.noreply.github.com> Co-authored-by: ashrawat <ashrawat_atl@yahoo.com> Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> Co-authored-by: pro-biswa <bisu@amazon.com> Co-authored-by: brentfriedman725 <97409987+brentfriedman725@users.noreply.github.com> Co-authored-by: Brent Friedman <brentfr@amazon.com>
2024-07-26 14:42:50 -07:00
]
},
{
"cell_type": "markdown",
"id": "77e86efb",
"metadata": {},
"source": [
"---\n",
"\n",
"To train and host on Amazon SageMaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3. \n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73573043",
"metadata": {},
"outputs": [],
"source": [
"import sagemaker, boto3, json\n",
"from sagemaker import get_execution_role\n",
"\n",
"aws_role = get_execution_role()\n",
"aws_region = boto3.Session().region_name\n",
"sess = sagemaker.Session()"
]
},
{
"cell_type": "markdown",
"id": "f713508e",
"metadata": {},
"source": [
"## 2. Select a pre-trained model\n",
"***\n",
"You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of JumpStart models can also be accessed at [JumpStart Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/jumpstart.html#).\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a3c7e6d",
"metadata": {},
"outputs": [],
"source": [
"model_id = \"huggingface-spc-bert-base-uncased\""
]
},
{
"cell_type": "markdown",
"id": "67c39c7b",
"metadata": {},
"source": [
"***\n",
"[Optional] Select a different JumpStart model. Here, we download jumpstart model_manifest file from the jumpstart s3 bucket, filter-out all the Sentence Pair Classification models and select a model.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76b5adb5",
"metadata": {},
"outputs": [],
"source": [
"import IPython\n",
"import ipywidgets as widgets\n",
"\n",
"# download JumpStart model_manifest file.\n",
"boto3.client(\"s3\").download_file(\n",
" f\"jumpstart-cache-prod-{aws_region}\", \"models_manifest.json\", \"models_manifest.json\"\n",
")\n",
"with open(\"models_manifest.json\", \"rb\") as json_file:\n",
" model_list = json.load(json_file)\n",
"\n",
"# filter-out all the Sentence Pair Classification models from the manifest list.\n",
"spc_models_all_versions, spc_models = [\n",
" model[\"model_id\"] for model in model_list if \"-spc-\" in model[\"model_id\"]\n",
"], []\n",
"[spc_models.append(model) for model in spc_models_all_versions if model not in spc_models]\n",
"\n",
"# display the model-ids in a dropdown, for user to select a model.\n",
"dropdown = widgets.Dropdown(\n",
" value=model_id,\n",
" options=spc_models,\n",
" description=\"JumpStart Sentence Pair Classification Models:\",\n",
" style={\"description_width\": \"initial\"},\n",
" layout={\"width\": \"max-content\"},\n",
")\n",
"display(IPython.display.Markdown(\"## Select a JumpStart pre-trained model from the dropdown below\"))\n",
"display(dropdown)"
]
},
{
"cell_type": "markdown",
"id": "27039ad1",
"metadata": {},
"source": [
"## 3. Run inference on the pre-trained model\n",
"***\n",
"Using JumpStart, we can perform inference on the pre-trained model, even without fine-tuning it first on a custom dataset. For this example, that means on a pair of input sentences predicting the class label from one of the 2 classes of the [QNLI](https://www.tensorflow.org/datasets/catalog/glue#glueqnli) dataset: entail, no-entail.\n",
"\n",
"***"
]
},
{
"cell_type": "markdown",
"id": "e9186664",
"metadata": {},
"source": [
"### 3.1. Retrieve JumpStart Artifacts & Deploy an Endpoint\n",
"***\n",
"We retrieve the deploy_image_uri, deploy_source_uri, and base_model_uri for the pre-trained model. To host the pre-trained model, we create an instance of [`sagemaker.model.Model`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) and deploy it.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cea1e68e",
"metadata": {},
"outputs": [],
"source": [
"from sagemaker import image_uris, model_uris, script_uris\n",
"from sagemaker.model import Model\n",
"from sagemaker.predictor import Predictor\n",
"from sagemaker.utils import name_from_base\n",
"\n",
"# model_version=\"*\" fetches the latest version of the model.\n",
"infer_model_id, infer_model_version = dropdown.value, \"*\"\n",
"\n",
"endpoint_name = name_from_base(f\"jumpstart-example-{infer_model_id}\")\n",
"\n",
"inference_instance_type = \"ml.m5.xlarge\"\n",
"\n",
"# Retrieve the inference docker container uri.\n",
"deploy_image_uri = image_uris.retrieve(\n",
" region=None,\n",
" framework=None,\n",
" image_scope=\"inference\",\n",
" model_id=infer_model_id,\n",
" model_version=infer_model_version,\n",
" instance_type=inference_instance_type,\n",
")\n",
"# Retrieve the inference script uri.\n",
"deploy_source_uri = script_uris.retrieve(\n",
" model_id=infer_model_id, model_version=infer_model_version, script_scope=\"inference\"\n",
")\n",
"# Retrieve the base model uri.\n",
"base_model_uri = model_uris.retrieve(\n",
" model_id=infer_model_id, model_version=infer_model_version, model_scope=\"inference\"\n",
")\n",
"# Create the SageMaker model instance. Note that we need to pass Predictor class when we deploy model through Model class,\n",
"# for being able to run inference through the SageMaker API.\n",
"model = Model(\n",
" image_uri=deploy_image_uri,\n",
" source_dir=deploy_source_uri,\n",
" model_data=base_model_uri,\n",
" entry_point=\"inference.py\",\n",
" role=aws_role,\n",
" predictor_cls=Predictor,\n",
" name=endpoint_name,\n",
")\n",
"# deploy the Model.\n",
"base_model_predictor = model.deploy(\n",
" initial_instance_count=1,\n",
" instance_type=inference_instance_type,\n",
" endpoint_name=endpoint_name,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8db81fa4",
"metadata": {},
"source": [
"### 3.2. Example input sentences for inference\n",
"***\n",
"Let's put in some example sentence pairs. You can put in any pairs of sentences, the model will predict whether the second sentence entails the first sentence or not.\n",
"These examples are taken from QNLI dataset downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#glueqnli). [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). [Dataset Homepage](https://rajpurkar.github.io/SQuAD-explorer/). [CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/legalcode).\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c122353f",
"metadata": {},
"outputs": [],
"source": [
"sentence_pair1 = [\n",
" \"How many octaves does Beyonce have?\",\n",
" \"Beyoncé's vocal range spans four octaves.\",\n",
"]\n",
"sentence_pair2 = [\n",
" \"How many octaves does Beyonce have?\",\n",
" \"While another critic says she is a \"\n",
" \"Vocal acrobat, being able to sing long and complex melismas and vocal runs effortlessly, and in key.\",\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "ec89c2fa",
"metadata": {},
"source": [
"### 3.3. Query endpoint and parse response\n",
"***\n",
"Input to the endpoint is a pair of sentences. Response from the endpoint is a dictionary containing the predicted class label, and a list of class label probabilities.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8a1211f",
"metadata": {},
"outputs": [],
"source": [
"newline, bold, unbold = \"\\n\", \"\\033[1m\", \"\\033[0m\"\n",
"\n",
"\n",
"def query_endpoint(encoded_text):\n",
" response = base_model_predictor.predict(\n",
" encoded_text, {\"ContentType\": \"application/list-text\", \"Accept\": \"application/json;verbose\"}\n",
" )\n",
" return response\n",
"\n",
"\n",
"def parse_response(query_response):\n",
" model_predictions = json.loads(query_response)\n",
" probabilities, labels, predicted_label = (\n",
" model_predictions[\"probabilities\"],\n",
" model_predictions[\"labels\"],\n",
" model_predictions[\"predicted_label\"],\n",
" )\n",
" return probabilities, labels, predicted_label\n",
"\n",
"\n",
"for sentence_pair in [sentence_pair1, sentence_pair2]:\n",
" query_response = query_endpoint(json.dumps(sentence_pair).encode(\"utf-8\"))\n",
" probabilities, labels, predicted_label = parse_response(query_response)\n",
" print(\n",
" f\"Inference:{newline}\"\n",
" f\"Input text: '{sentence_pair}'{newline}\"\n",
" f\"Model prediction: {probabilities}{newline}\"\n",
" f\"Labels: {labels}{newline}\"\n",
" f\"Predicted Label: {bold}{predicted_label}{unbold}{newline}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "225d0032",
"metadata": {},
"source": [
"### 3.4. Clean up the endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a6a4929",
"metadata": {},
"outputs": [],
"source": [
"# Delete the SageMaker endpoint and the attached resources\n",
"base_model_predictor.delete_model()\n",
"base_model_predictor.delete_endpoint()"
]
},
{
"cell_type": "markdown",
"id": "0895a89d",
"metadata": {},
"source": [
"## 4. Finetune the pre-trained model on a custom dataset\n",
"***\n",
"Previously, we saw how to run inference on a pre-trained model, which was fine-tuned on QNLI dataset. Next, we discuss how a model can be finetuned to a custom dataset. \n",
"\n",
"The Text Embedding model can be fine-tuned on any sentence pair \n",
"classification dataset in the same way the model available for inference \n",
"has been fine-tuned on the QNLI dataset.\n",
"The model available for fine-tuning attaches a binary classification layer to the Text Embedding model\n",
"and initializes the layer parameters to random values. The fine-tuning step fine-tunes \n",
"all the model parameters to minimize prediction error on the input data and returns the fine-tuned model.\n",
"The model returned by fine-tuning can be further deployed for inference. Below are the instructions \n",
"for how the training data should be formatted for input to the model. \n",
"\n",
"- **Input:** A directory containing a 'data.csv' file. \n",
" - Each row of the first column of 'data.csv' should have 0/1 integer class labels.\n",
" - Each row of the second column should have the corresponding first sentence. \n",
" - Each row of the third column should have the corresponding second sentence. \n",
"- **Output:** A trained model that can be deployed for inference. \n",
"\n",
"Below is an example of 'data.csv' file showing values in its first three columns. Note that the file should not have any header.\n",
"\n",
"| | | |\n",
"|---|---|---|\n",
"|0\t|What is the Grotto at Notre Dame?\t|Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection.|\n",
"|1\t|What is the Grotto at Notre Dame?\t|It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.|\n",
"|0\t|What sits on top of the Main Building at Notre Dame?\t|Atop the Main Building's gold dome is a golden statue of the Virgin Mary.|\n",
"|...|...|...|\n",
" \n",
"\n",
"QNLI dataset is downloaded from \n",
"[TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#glueqnli). \n",
"[Apache 2.0 License](https://jumpstart-cache-prod-us-west-2.s3-us-west-2.amazonaws.com/licenses/Apache-License/LICENSE-2.0.txt). \n",
"[Dataset Homepage](https://rajpurkar.github.io/SQuAD-explorer/). \n",
"[CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/legalcode). \n",
"***"
]
},
{
"cell_type": "markdown",
"id": "3bdc050d",
"metadata": {},
"source": [
"### 4.1. Retrieve JumpStart Training artifacts\n",
"***\n",
"Here, for the selected model, we retrieve the training docker container, the training algorithm source, the pre-trained model, and a python dictionary of the training hyper-parameters that the algorithm accepts with their default values. Note that the model_version=\"*\" fetches the latest model. Also, we do need to specify the training_instance_type to fetch train_image_uri.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4285dbf0",
"metadata": {},
"outputs": [],
"source": [
"from sagemaker import image_uris, model_uris, script_uris, hyperparameters\n",
"\n",
"model_id, model_version = dropdown.value, \"*\"\n",
"training_instance_type = \"ml.p3.2xlarge\"\n",
"\n",
"# Retrieve the docker image\n",
"train_image_uri = image_uris.retrieve(\n",
" region=None,\n",
" framework=None,\n",
" model_id=model_id,\n",
" model_version=model_version,\n",
" image_scope=\"training\",\n",
" instance_type=training_instance_type,\n",
")\n",
"# Retrieve the training script\n",
"train_source_uri = script_uris.retrieve(\n",
" model_id=model_id, model_version=model_version, script_scope=\"training\"\n",
")\n",
"# Retrieve the pre-trained model tarball to further fine-tune\n",
"train_model_uri = model_uris.retrieve(\n",
" model_id=model_id, model_version=model_version, model_scope=\"training\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5cdf0549",
"metadata": {},
"source": [
"### 4.2. Set Training parameters\n",
"***\n",
"Now that we are done with all the setup that is needed, we are ready to fine-tune our Sentence Pair Classification model. To begin, let us create a [``sagemaker.estimator.Estimator``](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) object. This estimator will launch the training job. \n",
"\n",
"There are two kinds of parameters that need to be set for training. \n",
"\n",
"The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. We defined the training instance type above to fetch the correct train_image_uri. \n",
"\n",
"The second set of parameters are algorithm specific training hyper-parameters.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "126f7100",
"metadata": {},
"outputs": [],
"source": [
"# Sample training data is available in this bucket\n",
"training_data_bucket = f\"jumpstart-cache-prod-{aws_region}\"\n",
"# For a quick demonstration of training we have created a random subset of QNLI dataset.\n",
"# For complete QNLI dataset replace \"QNLI-tiny\" with \"QNLI\" in the line below.\n",
"training_data_prefix = \"training-datasets/QNLI-tiny/\"\n",
"\n",
"training_dataset_s3_path = f\"s3://{training_data_bucket}/{training_data_prefix}\"\n",
"\n",
"output_bucket = sess.default_bucket()\n",
"output_prefix = \"jumpstart-example-spc-training\"\n",
"\n",
"s3_output_location = f\"s3://{output_bucket}/{output_prefix}/output\""
]
},
{
"cell_type": "markdown",
"id": "91068cff",
"metadata": {},
"source": [
"***\n",
"For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20818a89",
"metadata": {},
"outputs": [],
"source": [
"from sagemaker import hyperparameters\n",
"\n",
"# Retrieve the default hyper-parameters for fine-tuning the model\n",
"hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)\n",
"\n",
"# [Optional] Override default hyperparameters with custom values\n",
"hyperparameters[\"batch-size\"] = \"64\"\n",
"print(hyperparameters)"
]
},
{
"cell_type": "markdown",
"id": "4042f56a",
"metadata": {
"collapsed": false
},
"source": [
"### 4.3. Train with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)) <a id='AMT'></a>\n",
"***\n",
"Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4cd1443a",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from sagemaker.tuner import ContinuousParameter\n",
"\n",
"# Use AMT for tuning and selecting the best model\n",
"use_amt = True\n",
"\n",
"# Define objective metric per framework, based on which the best model will be selected.\n",
"metric_definitions_per_model = {\n",
" \"tensorflow\": {\n",
" \"metrics\": [{\"Name\": \"val_accuracy\", \"Regex\": \"val_accuracy: ([0-9\\\\.]+)\"}],\n",
" \"type\": \"Maximize\",\n",
" },\n",
" \"huggingface\": {\n",
" \"metrics\": [{\"Name\": \"eval_accuracy\", \"Regex\": \"'eval_accuracy': ([0-9\\\\.]+)\"}],\n",
" \"type\": \"Maximize\",\n",
" },\n",
"}\n",
"\n",
"# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)\n",
"hyperparameter_ranges = {\n",
" \"adam-learning-rate\": ContinuousParameter(0.000001, 0.001, scaling_type=\"Logarithmic\")\n",
"}\n",
"\n",
"# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).\n",
"max_jobs = 6\n",
"# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.\n",
"# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.\n",
"max_parallel_jobs = 2"
]
},
{
"cell_type": "markdown",
"id": "fcb6d08c",
"metadata": {},
"source": [
"### 4.4. Start Training\n",
"***\n",
"We start by creating the estimator object with all the required assets and then launch the training job.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fff5bd5a",
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.estimator import Estimator\n",
"from sagemaker.utils import name_from_base\n",
"from sagemaker.tuner import HyperparameterTuner\n",
"\n",
"training_job_name = name_from_base(f\"jumpstart-example-{model_id}-transfer-learning\")\n",
"\n",
"# Create SageMaker Estimator instance\n",
"spc_estimator = Estimator(\n",
" role=aws_role,\n",
" image_uri=train_image_uri,\n",
" source_dir=train_source_uri,\n",
" model_uri=train_model_uri,\n",
" entry_point=\"transfer_learning.py\",\n",
" instance_count=1,\n",
" instance_type=training_instance_type,\n",
" max_run=360000,\n",
" hyperparameters=hyperparameters,\n",
" output_path=s3_output_location,\n",
" base_job_name=training_job_name,\n",
")\n",
"\n",
"if use_amt:\n",
" metric_definitions = next(\n",
" value for key, value in metric_definitions_per_model.items() if model_id.startswith(key)\n",
" )\n",
"\n",
" hp_tuner = HyperparameterTuner(\n",
" spc_estimator,\n",
" metric_definitions[\"metrics\"][0][\"Name\"],\n",
" hyperparameter_ranges,\n",
" metric_definitions[\"metrics\"],\n",
" max_jobs=max_jobs,\n",
" max_parallel_jobs=max_parallel_jobs,\n",
" objective_type=metric_definitions[\"type\"],\n",
" base_tuning_job_name=training_job_name,\n",
" )\n",
"\n",
" # Launch a SageMaker Tuning job to search for the best hyperparameters\n",
" hp_tuner.fit({\"training\": training_dataset_s3_path})\n",
"else:\n",
" # Launch a SageMaker Training job by passing s3 path of the training data\n",
" spc_estimator.fit({\"training\": training_dataset_s3_path}, logs=True)"
]
},
{
"cell_type": "markdown",
"id": "1862744a",
"metadata": {},
"source": [
"## 4.5. Deploy & run Inference on the fine-tuned model\n",
"***\n",
"A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the class label of an input sentence. We follow the same steps as in [3. Run inference on the pre-trained model](#3.-Run-inference-on-the-pre-trained-model). We start by retrieving the jumpstart artifacts for deploying an endpoint. However, instead of base_predictor, we deploy the `spc_estimator` that we fine-tuned.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d912eb2",
"metadata": {},
"outputs": [],
"source": [
"inference_instance_type = \"ml.m5.xlarge\"\n",
"\n",
"# Retrieve the inference docker container uri\n",
"deploy_image_uri = image_uris.retrieve(\n",
" region=None,\n",
" framework=None,\n",
" image_scope=\"inference\",\n",
" model_id=model_id,\n",
" model_version=model_version,\n",
" instance_type=inference_instance_type,\n",
")\n",
"# Retrieve the inference script uri\n",
"deploy_source_uri = script_uris.retrieve(\n",
" model_id=model_id, model_version=model_version, script_scope=\"inference\"\n",
")\n",
"\n",
"endpoint_name = name_from_base(f\"jumpstart-example-FT-{model_id}-\")\n",
"\n",
"# Use the estimator from the previous step to deploy to a SageMaker endpoint\n",
"finetuned_predictor = (hp_tuner if use_amt else spc_estimator).deploy(\n",
" initial_instance_count=1,\n",
" instance_type=inference_instance_type,\n",
" entry_point=\"inference.py\",\n",
" image_uri=deploy_image_uri,\n",
" source_dir=deploy_source_uri,\n",
" endpoint_name=endpoint_name,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b64055c0",
"metadata": {},
"source": [
"---\n",
"Let's put in some example sentence pairs. You can put in any pairs of sentences, the model will predict whether the second sentence entails the first sentence or not.\n",
"These examples are taken from QNLI dataset downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#glueqnli). [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). [Dataset Homepage](https://rajpurkar.github.io/SQuAD-explorer/). [CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/legalcode).\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d14b6ae",
"metadata": {},
"outputs": [],
"source": [
"sentence_pair1 = [\n",
" \"How many octaves does Beyonce have?\",\n",
" \"Beyoncé's vocal range spans four octaves.\",\n",
"]\n",
"sentence_pair2 = [\n",
" \"How many octaves does Beyonce have?\",\n",
" \"While another critic says she is a \"\n",
" \"Vocal acrobat, being able to sing long and complex melismas and vocal runs effortlessly, and in key.\",\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "27cbadb9",
"metadata": {},
"source": [
"---\n",
"Next, we query the finetuned model, parse the response and print the predictions.\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4a82c41",
"metadata": {},
"outputs": [],
"source": [
"newline, bold, unbold = \"\\n\", \"\\033[1m\", \"\\033[0m\"\n",
"\n",
"\n",
"def query_endpoint(encoded_text):\n",
" response = finetuned_predictor.predict(\n",
" encoded_text, {\"ContentType\": \"application/list-text\", \"Accept\": \"application/json;verbose\"}\n",
" )\n",
" return response\n",
"\n",
"\n",
"def parse_response(query_response):\n",
" model_predictions = json.loads(query_response)\n",
" probabilities, labels, predicted_label = (\n",
" model_predictions[\"probabilities\"],\n",
" model_predictions[\"labels\"],\n",
" model_predictions[\"predicted_label\"],\n",
" )\n",
" return probabilities, labels, predicted_label\n",
"\n",
"\n",
"for sentence_pair in [sentence_pair1, sentence_pair2]:\n",
" query_response = query_endpoint(json.dumps(sentence_pair).encode(\"utf-8\"))\n",
" probabilities, labels, predicted_label = parse_response(query_response)\n",
" print(\n",
" f\"Inference:{newline}\"\n",
" f\"Input text: '{sentence_pair}'{newline}\"\n",
" f\"Model prediction: {probabilities}{newline}\"\n",
" f\"Labels: {labels}{newline}\"\n",
" f\"Predicted Label: {bold}{predicted_label}{unbold}{newline}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "638ff80c",
"metadata": {},
"source": [
"---\n",
"Next, we clean up the deployed endpoint.\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "82d144e8",
"metadata": {},
"outputs": [],
"source": [
"# Delete the SageMaker endpoint and the attached resources\n",
"finetuned_predictor.delete_model()\n",
"finetuned_predictor.delete_endpoint()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "4466ebda",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n",
"\n",
"![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_amazon_algorithms|jumpstart_sentence_pair_classification|Amazon_JumpStart_Sentence_Pair_Classification.ipynb)\n"
]
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science 3.0)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}